dplyr Data Manipulation

dplyr

Overview

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

mutate() adds new variables that are functions of existing variables
select() picks variables based on their names.
filter() picks cases based on their values.
summarise() reduces multiple values down to a single summary.
arrange() changes the ordering of the rows.

These all combine naturally with group_by() which allows you to perform any operation “by group”. You can learn more about them in vignette("dplyr"). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette("two-table").

dplyr is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code. Install the dbplyr package then read vignette("databases", package = "dbplyr").

If you are new to dplyr, the best place to start is the data import chapter in R for data science.

Installation

# The easiest way to get dplyr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just dplyr:
install.packages("dplyr")

# Or the development version from GitHub:
# install.packages("devtools")
devtools::install_github("tidyverse/dplyr")

If you encounter a clear bug, please file a minimal reproducible example on github. For questions and other discussion, please use the manipulatr mailing list.

Usage

library(dplyr)

starwars %>% filter(species == "Droid")
the %>% is read as "and then"

#> # A tibble: 5 x 13
#>   name  height  mass hair_color skin_color  eye_color birth_year gender
#>   <chr>  <int> <dbl> <chr><chr> <chr>    <dbl> <chr> 
#> 1 C-3PO    167   75. <NA> gold  yellow    112. <NA>  
#> 2 R2-D2     96   32. <NA> white, blue red  33. <NA>  
#> 3 R5-D4     97   32. <NA> white, red  red  NA  <NA>  
#> 4 IG-88    200  140. none metal red  15. none  
#> 5 BB8 NA   NA  none none  blackNA  none  
#> # ... with 5 more variables: homeworld <chr>, species <chr>, films <list>,
#> #   vehicles <list>, starships <list>

starwars %>% select(name, ends_with("color"))
#> # A tibble: 87 x 4
#>   name     hair_color skin_color  eye_color
#>   <chr>    <chr><chr> <chr>    
#> 1 Luke Skywalker blondfair  blue     
#> 2 C-3PO    <NA> gold  yellow   
#> 3 R2-D2    <NA> white, blue red
#> 4 Darth Vader    none white yellow   
#> 5 Leia Organa    brownlight brown    
#> # ... with 82 more rows

starwars %>% mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>% select(name:mass, bmi)
#> # A tibble: 87 x 4
#>   name     height  mass   bmi
#>   <chr>     <int> <dbl> <dbl>
#> 1 Luke Skywalker    172   77.  26.0
#> 2 C-3PO 167   75.  26.9
#> 3 R2-D2  96   32.  34.7
#> 4 Darth Vader 202  136.  33.3
#> 5 Leia Organa 150   49.  21.8
#> # ... with 82 more rows

starwars %>% arrange(desc(mass))
#> # A tibble: 87 x 13
#>   name    height  mass hair_color skin_color  eye_color  birth_year gender
#>   <chr>    <int> <dbl> <chr><chr> <chr>     <dbl> <chr> 
#> 1 Jabba …    175 1358. <NA> green-tan,… orange    600.  herma…
#> 2 Grievo…    216  159. none brown, whi… green, ye… NA   male  
#> 3 IG-88200  140. none metal red  15.0 none  
#> 4 Darth …    202  136. none white yellow     41.9 male  
#> 5 Tarfful    234  136. brownbrown blue NA   male  
#> # ... with 82 more rows, and 5 more variables: homeworld <chr>,
#> #   species <chr>, films <list>, vehicles <list>, starships <list>

starwars %>% group_by(species) %>% summarise(
    n = n(),
    mass = mean(mass, na.rm = TRUE)
  ) %>% filter(n > 1)
n=n() means that n = count of rows in the summarized data.

#> # A tibble: 9 x 3
#>   speciesn  mass
#>   <chr>    <int> <dbl>
#> 1 Droid  5  69.8
#> 2 Gungan 3  74.0
#> 3 Human 35  82.8
#> 4 Kaminoan     2  88.0
#> 5 Mirialan     2  53.1
#> # ... with 4 more rows

some examples:
starwars %>% filter((species == "Droid")&(skin_color=="gold"))
starwars %>% filter(species == "Droid") %>% filter(skin_color=="gold")

which(starwars$species == "Droid") # return indexes only 2 3 8 22 85

starwars %>% select(name, mass, ends_with("year"))
starwars %>% mutate(name, bmi = mass / ((height / 100)  ^ 2)) %>% select(name:mass, bmi) %>% arrange(desc(bmi))

starwars %>% group_by(species) %>% summarise(n = n())

starwars %>% group_by(species) %>% summarise(H = mean(height))

starwars %>% group_by(species) %>% count()

nrow(starwars %>% filter(species == "Human")) # 35
starwars %>% filter(species == "Human") %>% count() # 35

length(unique(starwars$species)) # 38

Functions in dplyr

Name	Description
all_vars	Apply predicate to all variables
compute	Force computation of a database query
distinct	Select distinct/unique rows
as.tbl_cube	Coerce an existing data structure into a tbl_cube
arrange	Arrange rows by variables
cumall	Cumulativate versions of any, all, and mean
copy_to	Copy a local data frame to a remote src
auto_copy	Copy tables to same source, if necessary
filter	Return rows with matching conditions
filter_all	Filter within a selection of variables
do	Do anything
group_by_all	Group by a selection of variables
check_dbplyr	dbplyr compatibility functions
coalesce	Find first non-missing element
backend_dbplyr	Database and SQL generics.
explain	Explain details of a tbl
bind	Efficiently bind multiple data frames by row and column
all_equal	Flexible equality comparison for data frames
failwith	Fail with specified value.
add_rownames	Convert row names to an explicit variable.
case_when	A general vectorised if
group_by_prepare	Prepare for grouping.
group_indices	Group id.
join	Join two tbls together
ident	Flag a character vector as SQL identifiers
n	The number of observations in the current group.
location	Print the location in memory of a data frame
lead-lag	Lead and lag.
desc	Descending order
n_distinct	Efficiently count the number of unique values in a set of vector
dim_desc	Describing dimensions
id	Compute a unique numeric id for each unique row in a data frame.
join.tbl_df	Join data frame tbls
order_by	A helper function for ordering window function output
na_if	Convert values to NA
band_members	Band membership
funs	Create a list of functions calls.
bench_compare	Evaluate, compare, benchmark operations of a set of srcs.
recode	Recode values
reexports	Objects exported from other packages
group_size	Calculate group sizes.
progress_estimated	Progress bar with estimated time.
between	Do values in a numeric vector fall in specified range?
nasa	NASA spatio-temporal data
group_by	Group by one or more variables
tally_	Deprecated SE versions of main verbs.
select_all	Select and rename a selection of variables
select	Select/rename variables by name
sample	Sample n rows from a table
near	Compare two numeric vectors
if_else	Vectorised if
nth	Extract the first, last or nth value from a vector
init_logging	Enable internal logging
src_dbi	Source for database backends
rowwise	Group input by rows
src_local	A local source.
scoped	Operate on a selection of variables
dplyr-package	dplyr: a grammar of data manipulation
summarise_all	Summarise and mutate multiple columns.
summarise_each	Summarise and mutate multiple columns.
same_src	Figure out if two sources are the same (or two tbl have the same source)
dr_dplyr	Dr Dplyr checks your installation for common problems.
top_n	Select top (or bottom) n rows (by value)
tbl_vars	List variables provided by a tbl.
select_vars	Select variables
grouped_df	A grouped data frame.
storms	Storm tracks data
tidyeval	Tidy eval helpers
src_tbls	List all tbls provided by a source.
vars	Select variables
starwars	Starwars characters
summarise	Reduces multiple values down to a single value
with_order	Run a function with one order, translating result back to original order
tally	Count/tally observations by group
tbl	Create a table from a data source
tbl_cube	A data cube tbl
tbl_df	Create a data frame tbl.
groups	Return grouping variables
make_tbl	Create a "tbl" object
mutate	Add new variables
pull	Pull out a single variable
ranking	Windowed rank functions.
setops	Set operations
slice	Select rows by position
sql	SQL escaping.
src	Create a "src" object
common_by	Extract out common by variables
arrange_all	Arrange rows by a selection of variables
as.table.tbl_cube	Coerce a tbl_cube to other data structures
No Results!

Vignettes of dplyr

Name
internals/hybrid-evaluation.Rmd
compatibility.Rmd
dplyr.Rmd
programming.Rmd
two-table.Rmd
window-functions.Rmd
No Results!

Useful dplyr Functions

The R package dplyr is an extremely useful resource for data cleaning, manipulation, visualisation and analysis. 
It contains a large number of very useful functions and is, without doubt, one of my top 3 R packages today (ggplot2 and reshape2 being the others). 

Commonly used in data manipulation tasks. 

select() 
filter()
mutate() 
group_by() 
summarise()
arrange() 
join()

require(dplyr)

# Data file
file <- "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"

# Some sensible variable names
df_names <- c("age", "wrkclass", "fnlweight", "education_lvl", "edu_score",
 "marital_status", "occupation", "relationship", "ethnic", "gender",
 "cap_gain", "cap_loss", "hrs_wk", "nationality", "income")

# Import the data
df <- read.csv(file, header = F,
 sep = ",",
 na.strings = c(" ?", " ", ""),
 row.names = NULL,
 col.names = df_names)

Many data manipulation tasks in dplyr can be performed with the assistance of the forward-pipe operator (%>%). 

The first function I would like to introduce removes duplicate entries which, in fact, is a preprocessing step one may carry out in a data analysis. 
It is so useful that it must be included.
# Remove duplicate rows and check number of rows
df %>% distinct() %>% nrow()

# Drop duplicate rows and assign to new dataframe object
df_clean <- df %>% distinct()

# Drop duplicates based on one or more variables
df %>% distinct(gender, .keep_all = T)
df %>% distinct(gender, education_lvl, .keep_all =  T)

Taking random samples of data is easy with dplyr.
# Sample random rows with or without replacement
sample_n(df, size = nrow(df) * 0.7, replace = F)
sample_n(df, size = 20, replace = T)

# Sample a proportion of rows with or without replacement
sample_frac(df, size = 0.7, replace = F)
sample_frac(df, size = 0.8, replace = T

Renaming variables is also easy with dplyr.
# Rename one or more variables in a dataframe
df <- df %>% rename("INCOME" = "income")
df <- df %>% rename("INCOME" = "income", "AGE" = "age")

The main “verbs” of dplyr are now introduced. 
Let’s begin with the select() verb which filters a dataframe by column.
# Select specific columns (note that INCOME is the new name from earlier)
df %>% select(education_lvl, INCOME)
 
# With dplyr 0.7.0 the pull() function extracts a variable as a vector
df %>% pull(age)

# Drop a column using the - operator (variable can be referenced by name or column position)
df %>% select(-edu_score)
df %>% select(-1, -4)
df %>% select(-c(2:6))

Some useful helper functions are available in dplyr and can be used in conjunction with the select() verb. 
Here are some quick examples.
# Select columns with their names starting with "e"
df %>% select(starts_with("e"))

# The negative sign works for dropping here too
df %>% select(-starts_with("e"))

# Select columns with some pattern in the column name
df %>% select(contains("edu"))

# Reorder data to place a particular column at the start followed by all others using everything()
df %>% select(INCOME, everything())

# Select columns ending with a pattern
df %>% select(ends_with("e"))

df %>% select(ends_with("_loss"))

The next major verb we look at is filter() which, surprisingly enough, filters a dataframe by row based on one or more conditions.
# Filter rows to retain observations where age is greater than 30
df %>% filter(age > 30)

# Filter by multiple conditions using the %in% operator (make sure strings match)
df %>% filter(relationship %in% c(" Unmarried", " Wife"))

# You can also use the OR operator (|)
df %>% filter(relationship == " Husband" | relationship == " Wife")

# Filter using the AND operator
df %>% filter(age > 30 & INCOME == " >50K")

# Combine them too
df %>% filter(education_lvl %in% c(" Doctorate", " Masters") & age > 30)

# The NOT condition (filter out doctorate holders)
df %>% filter(education_lvl != " Doctorate")

# The grepl() function can be conveniently used with filter()
df %>% filter(grepl(" Wi", relationship))

Next, we look at the summarise() verb which allows one to dynamically summarise groups of data and even pipe groups to ggplot data visualisations.
# The summarise() verb in dplyr is useful for summarising grouped data
df %>% filter(INCOME == " >50K") %>%
 summarise(mean_age = mean(age), median_age = median(age), sd_age = sd(age))

# Summarise multiple variables using summarise_at()
df %>% filter(INCOME == " >50K") %>%
 summarise_at(vars(age, hrs_wk), funs(n(), mean, median))

# We can also summarise with custom functions
# The . in parentheses represents all called variables
df %>% summarise_at(vars(age, hrs_wk),
 funs(n(), missing = sum(is.na(.)), mean = mean(., na.rm = T)))

# Create a new summary statistic with an anonymous function
df %>% summarise_at(vars(age),
 function(x) { sum((x - mean(x)) / sd(x)) })

# Summarise conditionally using summarise_if()
df %>% filter(INCOME == " >50K") %>% summarise_if(is.numeric, funs(n(), mean, median))
 
# Subset numeric variables and use summarise_all() to get summary statistics
ints <- df[sapply(df, is.numeric)]
summarise_all(ints, funs(mean, median, sd, var))

Next up is the arrange() verb which is useful for sorting data in ascending or descending order (ascending is default).
# Sort by ascending age and print top 10
df %>% arrange(age) %>% head(10)

# Sort by descending age and print top 10
df %>% arrange(desc(age)) %>% head(10)

The group_by() verb is useful for grouping together observations which share common characteristics.
# The group_by verb is extremely useful for data analysis
df %>% group_by(gender) %>% summarise(Mean = mean(age))
df %>% group_by(relationship) %>% summarise(total = n())
df %>% group_by(relationship) %>% summarise(total = n(), mean_age = mean(age))

The mutate() verb is used to create new variables from existing local variables or global objects. 
New variables, such as sequences, can be also specified within mutate().
# Create new variables from existing or global variables
df %>% mutate(norm_age = (age - mean(age)) / sd(age))
 

# Multiply each numeric element by 1000 (the name "new" is added to the original variable name)
df %>% mutate_if(is.numeric, funs(new = (. * 1000))) %>% head()

The join() verb is used to merge rows from disjoint tables which share a primary key ID  or some other common variable. 
There are many join variants but I will consider just left, right, inner and full joins.
# Create ID variable which will be used as the primary key
df <- df %>% mutate(ID = seq(1:nrow(df))) %>% select(ID, everything())

# Create two tables (purposely overlap to facilitate joins)
table_1 <- df[1:50 , ] %>% select(ID, age, education_lvl)

table_2 <- df[26:75 , ] %>% select(ID, gender, INCOME)

# Left join joins rows from table 2 to table 1 (the direction is implicit in the argument order)
left_join(table_1, table_2, by = "ID")

# Right join joins rows from table 1 to table 2
right_join(table_1, table_2, by = "ID")

# Inner join joins and retains only complete cases
inner_join(table_1, table_2, by = "ID")

# Full join joins and retains all values
full_join(table_1, table_2, by = "ID"

That wraps up a brief demonstration of some of dplyr’s excellent functions. 
For additional information on the functions and their arguments, check out the help documentation using the template: ?